﻿--------------------------------------------------------------------------------
README: RAN RAF v1.0
--------------------------------------------------------------------------------
1. About READ AFRIKAANS NORMAL/ READ AFRIKAANS FAST v1.0
2. Using and citing
3. Development process
4. Directory structure
--------------------------------------------------------------------------------
1. About RAN RAF v1.0

The corpus contains speech of 127 mother tongue speakers of Afrikaans, and is intended to become a primary national language resource, with a view of creating a resource for phonetic researchers, as well as for developers of acoustic models and lexicons for automatic speech recognition (ASR) in Afrikaans. In its creation there was a close collaboration with ELIS-UGent (Belgium). 
--------------------------------------------------------------------------------
2. Using and citing

Licence: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext

When using the data, please cite:

Wissing, D. 2018. ISLRN: ##
--------------------------------------------------------------------------------
3. Development process 

Speakers (127 in total; female as well as male) were sampled according to race and region so as to attain a rich pallet of accents. For reasons of availability, mainly university students were recruited, coming from all regions of South Africa. Although care was taken to ensure broad coverage, no specific details on the speakers are available.

All recordings were made with professional equipment in a sound-treated studio at the local university. Signals were stored directly on the hard disc of a computer (256 kb/s, 16.0 kHz, 16 bit, 1 channel, PCM (Little / Signed)).

Every speaker was asked to read a text fragment from a book or a newspaper (about 20-40 seconds of speech per speaker) at a normal speed (RAN), followed by reading a shorter portion of the same text (about 15 seconds) at a faster speed (RAF).
The audio was transcribed on orthographic level and segmented. All segments with mispronunciations were removed before the additional levels (described below) were added.

Duration and samples of recordings: 

TOTAL DURATION: 131 minutes.

RAN:
4002 seconds

RAF:
3904 seconds

Format: wav files with associated transcriptions in textgrid format.
Note: These files are compatible with Praat v6, available from: http://www.fon.hum.uva.nl/praat/

Each recording is transcribed/segmented on 5 levels/tiers:
1. segment (phoneme)
2. syllable
3. word
4. Speaker and utterance number, corresponding to the filename
5. text

The phoneset used for the transcription: IPA transcription

The filenames are structured as:
RAF_SPEAKERNUMBER_UTTERANCENUMEBER.wav/textgrid
RAN_SPEAKERNUMBER_UTTERANCENUMEBER.wav/textgrid

All words present in the recordings are also provided as an alphabetic list (in orthographic and SAMPA form; transcribed according to the dictionary form/most popular pronunciation) in the file List.RANRAF.Lexicon.1.0.txt.
--------------------------------------------------------------------------------
